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misconception students have is active balancing, and they oppose this 
mechanism with the notion of "swamping." Current research suggests 
that such an approach is likely to be unfruitful because the problem 
is not that students think in terms of an incorrect process mechanism 
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ftbstract 

There is a growing " body o£ evidence indicating that people often 
overestimate the si.ilarity between characteristics of rando, sanples and 
those of the populations fro. which they are drawn. " In tt* first section of 
the paper, we review some studies that have attempted to determine whether the 
,bssic heuristic esploy.d in thinking about rsndoa samples is passive and 
descriptive or whether it is deducible. fro* a belief in active balancing. In 
the second section, we discuss the importance of sample size on judgments 

0" 

about the characteristics of random staples. 
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Statistical Seasoning in Novices 

We have been conducting research on intuitions about statistical concepts 

for several years, in large part because we believe that statistical 

reasoning is a very important kind of thinking, but also because we are 

responsible for teaching a number of statistics courses. There is a great 

deal of uncertainty associated with the data underlying . most branches of 

J 

science. Empirical data are characterized by measurement error, and for many 
problems the evidence 01 information required is known with varying degrees of 
confidence. The methodology used almost universally for dealing with^ this 
uncertainty employs- the model of probability theory together with a variety of 
supposedly normative procedures for staking predictions and decisions. An 
important goal of a course in statistics is to provide the student with 
sufficient skills and knowledge to be able to make reasonable judgments in the 
face of uncertain information from various sources: e.g., experimental data, 

s 

the research literature, and such popular sources as newspapers and magazines. 

Unfortunately, the standard undergraduate statistics course aimed at 
social science majors often does not eeem to provide adequate skills or 

\ 

understanding. Host undergraduates coming out of such a course : do not 
understand the basic concepts well enough to generalize to situations not 
explicitly covered in the course and we have found that they frequently have 
trouble even with those situations that were explicitly covered. For example, 
aany students do not fully understand even such basic concepts as the mean 
(e.g., Pollatsek, Lisa, & Weil, 1981). Kany students think of the mean only 
in teras of a computational algorithm and consequently sake predictable kinds 
of mistakes in attempting to solve weighted aean problems. Further research 
has shown that students who have had a introductory "statistics course are 
little better than those who have not. Furthermore, students are often unable 



to explain exactly what can and cannot be concluded from the procedures 

* • 

learned in the course. 

U e believe that the aajor ^reason the standard undergraduate statistics 
course is not as successful as we would like is that generally no explicit, 
effort, is aade to assess a priori and appropriately aodify the cognitive 
structures of the student. Courses that eaphasize calculations and those 
that eaphasize mathematical derivations usually ignore the issue of "basic 
understanding.** However, even an attempt to use an intuitive approach that 
enphasizes understanding of basic concepts and principles can be frustrating, 
since we have only recently started to understand the intuitions and 

preconceptions $hat the student brings to the statistics class. Given that 

#• _ 

the instructor has far aore experience with the concepts and methods of 
statistics than the student, it is possible that organizing the content in the 
way that seeas most logical to the instructor aay not be the best way of 
encouraging understanding by the student. In ou* opinion, it is necessary to 
know the preconceptions and kinds of thinking that characterize the cognitive 
structures of the students and what structures characterize different levels 
of understanding. From such infonation, the instructor can plan a course 
that is within the grasp of students and yet serves -to achieve the desired 

level of expertise. 

In the present paper, , we will review sooe of the work that we and others 
have done to try to understand some of the intuitions that people have about 
a very iaportant concept in statistics, namely, random sampling. 

fieoresentativene3s versus Active Bala ncing 
There is at present a large body o£ ev pence indicating that novices 
believe that random sasples resemble the population froa which they are drawn. 
If the sample size is sufficiently large, then a randoa sample will, in fact, 
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fee siailar to the parent population. Where the typical novice differs from 
the normative model is that, at least under certain conditions, ho or she 
believes that awaU as well as large samples have a high probability of 
looking like the population. Tversky and Kahneaan (1S71) have dubbed .this 
misconception "The Law of Snail Nuabers.** They proposed that a heuristic 
called "representativeness** underlies this sisconception. "A .person who 
follows this belief evaluates the probability of an uncertain event, or a 
sasple, by the degree to which it is: <1> siailar in essential properties to 
its parent population; and (2) reflects the salient features of the process by 
which it is generated." (Kahneaan & Tversky, 1972, p. 431). 

• One source of evidence for this misconception has come froa investigation 
cf what is generally known as the "gambler's fallacy." A sisple example of 
the gaabier's fallacy is the belief that if affair coin has cone up heads a 
large number of times in a row r %hen there is an increased chance that it will 
cone up heads on the next toss. The gabbler's fallacy can be described as the 
belief that in random sampling, the data that have already been ssspled will 
influence the data that are ^et to be sampled. This, of course, . violates 
independence, which is a fundamental property of true random sampling. In 
real-life coin tossing, shaking the coin well between tosses would o guarantee 
some reasonable approximation to independence. 

The prototypical problem used by Tversky and Kahnenan €1971) to explore 
the gambler's fallacy was as follows: 

The aean 10 of the population of eighth graders in a city 
is known to be 100. You have selected a random saapie of 
50 children for a study of educational achievements. The 
first child tested has an ZQ of 150. What do you expect 
the Bean IQ to be for the whole sample? 

If the sampling were random, then the best guess for the mean score of the 

next 43 children sampled is 100. Therefore the best guess for the aean of the 
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entire sample of 50 children is 101, the weighted mean of ISC and 100. 
However, the typical answer, given to this problem is 100. Answering M 10O M is 
consistent with the gambler's fallacy because it seems to imply that the score 
of the first child chosen influences the aean of the scores of the next 49. 

Kahneman and Tversky (1972 J and Ber-Killel (1980) have employed a second 
paradigm to demonstrate the heuristic of representativeness. Subjects were 
shown two samples and asked" to judge which was more likely to have occurred. 
In their original work, Kahneman and Tversky (1972) dealt with with events 
modelled by Bernoulli trials. They found, for example, that, subjects thought 
that for a sequence of six births, the exact order G B G B G B is more likely 
than the order B G B B B B, presumably because the the sequence with five 
•boys and one girl fails to reflect the the proportion of boys and girls in the 
population. Subjects also thought that a sequence like B B B G C G was less 
probable than a sequence like G B B G B G, presumably because it seems less 
random. Bar-Hillei (1980) has extended this research to .determine which 
characteristics of -samples subjects are attending to when they judge the 
occurrence of one sample to be aore or less probable than that of another. 
She found that subjects think that a sample should have not only the same mean 
as the population, but also the same degree of variability* 

The evidence thus is compelling that subjects believe samples (even small 
samples) should look like the population and that random samples should look 
random. Other work that. we will not discuss here (Nisbett & Ross, 1980) 
indicates that subjects are insensitive to sampie bias. In the work described 
in this section, our interest was in determining whether the novice's theory 
of random samples follows directly from the heuristic of representativeness or 
whether is is deducible from some more basic mechanistic belief. This 
distinction will become clearer if we digress for a moment and speculate how 



an expert thinks about sampling. 

Presumably, the expert's fundamental conception of random sampling is in 
terms of a process model. Perhaps the moat widely used model of random 
sampling (with replacement) is to view saipiing as isomorphic to the process 
of drawing a labeled bail or slip of paper from an urn or box, recording the 
outcome, replacing it, shaking well, and then drawing again." Fro* this model, 
the idealization of which can be characterized by algebraic expressions, 
certain conclusions follow. These include "The Law of Large Numbers" which 
says (roughly) that if a random sample is large enough, the relative 
frequencies of outcomes in the sample have a very high probability of being a 
close approximation to those m the population. 

The tendency for novices to believe that even small samples are quite 
representative could" plausibly follow from either of two basic heuristics. 
The first possibility is that the basic heuristic is representativeness, in 
other words, that the way novices think about random samples is ' primarily 
d escriptive : random samples look approximately like the population and, 
further, random sequences of events look "random." However, there is a second 
possibility. Subjects could have an erroneous process, model of sampling from 
which it followed that even small Samples were highly representative of the 
parent population. A model that has been suggested in a number of statistics 
texts (e.g. Freedman et. al., 1978, Chapter 16; Hays, 1981, Chapter 1) is 
"active balancing" or "compensation," an active process that guarantees that 
things will "even out" in the long (and not so long) run. ^ In the coin-tossing 
example, the balancing model would suggest that following 5 , for example, a run 
of tails, the next toss is vsry likely to come up heads. 

It is difficult to separate out these two views of sampling, since the 
heuristic ■ of active balancing could be dcuucible from that of 
representativeness. If, in the coin example, subjects believe that samples 
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should look like the population .of outcoae* of tosses (which for a fair coin * 

* 

would be idealised as r half heads and half tails), samples that are close to 



half heads and half tails will be aos t rep resentative. If one has already 
observed nine heads and is predicting the outcose of the next toss, then since 
a sample of nine heads and one tail is sore representative • of the population 
than a saaple of ten heads, the outcoae of "tail" on the tenth trial would be 
considered to pQ aore probable. „ 

On what basis can one determine wi^eiher , the descriptive or active 
balancing heuristic is the acre basic? In the IQ *ex«ple mentioned earlier, 
both heuristics would predict an answer of 100. However,, situations exist in 
which the descriptive and active balancing heuristics aight lead to different 
predictions. If we asked subjects to predict the nean score of the last 49 
students in the sample, we sight expect those who thought that all samples 
^should look like the population to give an answer of 100, but those who 
thought in teras of an active balancing heuristic to give an answer scalier 
than 100 C&o that the entire saaple of 50 scores could average 100). We 



therefore atteapted to ^extend the KahneJLSiL^a4--Tvefsky findings - by eaploying 

additional follow-up questions about the »ean of the saaple excluding the 

» 

known score. In addition we were concerned that the interpretation of' the 

4 • 

results of the IQ problem aay have been complicated by the possibility that 

subjects • were sinply not being very precise with numbers. For example, 

<, .... 

subjects say simply have thought of 101 as being -approximately 100, w and 
therefore given the answer 100 even though they knew the i*ean would be 
slightly higher. We therefore aade sure that in the problems we used, the 
difference between the correct answer and the population aean would be «ore 
salient. We also did not depend exclusively on questionnaire data but also 
conducted interviews with sone subjects in which they were instru<r£ed to think 

v 
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T&eud while senerstiag their answers so that we could better 'understand the 

«••■,'•**. ■ .. ' 

c heuristics they were employing. . . 

In the first study* we* employed several problems that were sisiiar in 

* * • * 

£orx to the IQ problem mentioned earlier. One problem dealt with SAT scores 

\' v 

' •>* ■ " 

end read as follows: . 

the average SA? score for ail the high school students in 
a large school district is known to be 400. You h?ve 
randonly picked 10 students for a study in educational . 
achievement. The first student you picked hod' <sn SA? 
of 250. What do you expect the average SA? to be for the 
entire saaple of 10? 

m What do you expect the average SAT to be for the next S 
students* not Including the 220? 

(The correct answer to the first question is 385, to the 
second, 400,5 

Problems were administered in questionnaire forn to 205 students in four 
undergraduate psychology statistics classes* In addition, interviews were 
conducted with 31 subjects who were selected from a pool of student volunteers 
and received bonus credit for their participation. 

The dSta are displayed in ?able 1. For the interviewed subjects, ; the 
data presented are based on answers given before any interviewer intervention. 
The answer predicted by representativeness, naseiy that the neens of both 
saaples are equal to the population mean, wss the sods! response. It was 

m 

m 

given by 33*2 of the subjects answering the questionnaires and by 4S« of the 
subjects in the interviews* Twenty-one percent gave the correct solution and 
only 13* gave answers consistent with the balancing heuristic. 



Insert Table 1 about here. 



In addition, *3?n of the questionnaire subjects and 12% of the interview 
subjects gave answers -that were not consistent with the correct solution. 
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representa t i van«*a» cr balancing. The feet .that these "deviant" answers- veto' 

* T 

cote likely to be- found in the questionnaire data, suggests that «et lekeV* some 

» it 

of then occurred as a result of not reading 'the question carefully enough, 
thus misunderstanding it on a trivial level. However, onff.pa^esC, tfatfeled 
••Trend" in Table 1). deserves sone consent, because it also occurred' - in the 
interviews and has an underlying rationale. Subjects giving this pattern 
thought (correctly) that the aean of the sample of ten would be lower* than 
400. In addition* the two means they gave were consistent in # that the mean 
of ten could be the average of the first observation and the average of the 
next nine observations. However* their prediction for the average of the "next 
nine observations was also less than 400* Consents frost the subjects in the 

. « , 

-interviews who showed this pattern indicated that the divergent first score 
led then to doubt thac the population aean was actually 400 as stated in the 
problem.' . * 

In summary, these results replicate those' of Kahnenan and Tversky (1972? 
in that the modal estiaate of the aean of the sample of ten was the population 
nean. Xore. importantly, " 71% of the 95 questionnaire sub^ectl^and 71% of the 
21 interview subjects who gave the population aean for the $tean of the Sample 
of ten also gave the population F.ean'as their best estimate of the aean of the 
nine uftknown scores. . The percentage for each group was significantly greater 
than 50fc, <1}*26.5, p<.001^and ^U>*3.S&, p<.05, respectively. This 
pattern is inconsistent with a balancing heuristic and indicated that these 
subjects thought that both the sample of ten scores and the sample of nine 
should bo representative. Moreover, representativeness could even be the 
fundamental heuristic for subjects who we classified as "balancers." " One 
could claim that tjj^se^subjocts took the sample of ten ~a& fundamental, 
believing that it should be ''representative, and then decided that "the estimate 
they gave for the sample of nine should ^ consistent with their first answer. 

KSTCOPYAVAiUmUa _ 
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On tht other hand, *it is possible that subjects who give. answers consistent 
\>ith the balancing touristic 'think about the probiea in « fundamentally 

different .way. & 

■ * ■• 

We had hoped that detailed analyses of the interview * videotapes would- 

c 

provide further insights into subjects' heuristics. Unfortunately, we hud 
audio difficulties with, the recording equipnent that sade evaluation of the 
. interviews extremely difficult. We therefore conducted a new set of 
interviews ^using a relatively standardized ^et of probe questions based on en 
analysis of the most informative probes used in the first study. The focus of 
-these sore standardized interviews was to confront subjects with solutions 
different from their own. We believed that the interview forsat would allow 
us to evaluate the strength of subjects' confidence in their answers. If they 
maintained their solution after being shown reasonable alternatives* one could, 
conclude that their original answer was not frivolous. In addition, since 
subjects were given only the numerical answers fo.r the alternative solutions 

* * 

and were asked what they thought the rationale was for these solutions, their 
understanding of the problem soiild be assessed ao^e completely. 

Interviews, were conducted on 26 student volunteers who were recruited 
fro* undergraduate .psychology courses. A variation on the SAT problem 
mentioned earlier was given to each subject. For half the 'subjects, the 
problem was exactly the saate as the one given previously, and for the other 
half, the problem was the saute except tftet first student sampled was said to 
have an SAT score of 550 instead of 250, so that the correct answer for the 
estimate of th$ mean of the saapie of ten scores was now 415. 

The subject read the first part of the problem which asked for the best 
^estimate pf the mean of the sample of ten scores and answereo it, being 
encouraged . to think out loud as much as possible. ^ After the subject gave an' 
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answer, the interviewer asked for the aubject's best estimate of theaean of 
the nine unknown scores. Until the second answer was given, the interviewer 
did. not intervene except to ciorify parts of the problem upon request, to 
correct the subject if he cr she misread the question, and to encourage the 
subject to think out loud. The subject's answers (assuming the first score 
was 250) were classified by the interviewer as «) demonstrating the correct 
rationale (if the answers to the questions were were less than 400 and 400. 
"respectively); <2> demonstrating representativeness <i^ both answers were 
4005; or C33 demonstrating balancing <if the answerr were V and greater than 
4005. •» 

The interviewer then told the subject that the problem had been given to 
many other students and that he was going to present some of their answers. 
The subject was then presented with one of the two patterns of answers that he 
or she had not given and asked to coament on it. For example, if the 
subject's answers had been been classified as . "representative," the 
interviewer might then say that some people had given a pattern of responses 
in which the best estimate of the mean of the sample of ten scores was less 
than 400 and the estimate of the mean of the nine unknown scores was 400 
(i.e., the .correct solution). The subject was asked if he or she could figure 
out the possible rationale for such answers and then what he or she thought of 
this approach. In the next part of the interview the subject .would be 
presented with numerical answers consistent with the balancing solution and 
the same scries of questions would ensue. Following this, the subject would 
be asked explicitly what he or she thought the best answers were. (The 
suggestion that subjects might want to reconsider their answers is, of course, 
implicit in presenting alternative answers.) The order of presentation of the 
alternative patterns of answers wers appropriately, counterbalanced over 
subjects. . The correct answer was never identified cs such. 
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The results were very similar to those of the first study (see Table 2). 
Before subjects were presented with the alternative solutions, the modal 
response was again representative (56*), while 20% chose the correct solution 

t 

and only 12* responded with a pattern consistent with the balancing heuristic. 



Insert Table 2 about here 

The most striking aspect of the data is that the pattern of results at 
the end of the interview after alternative solutions had been presented, 
differed very little f roa those obtained before interviewer intervention. Df 
the 23 subjects of interest (one subject terminated the interview prematurely 
and the initial answers of two others were .not classifiable), only four 
changed their answers as a result of considering the alternative solutions. 
We can conclude that the representative answer is not merely a hasty response 
to the problem, since when presented with the correct and the balancing 
solutions, 12 of the 14 subjects maintained their representative answer. 

•Although we do not have the space to go into any detail about the 
verbalizations of the subjects^ we v.'ill summarise a few points. In giving 
their own initial answers, only two subjects gave what could be construed as a 
balancing rationale, saying that there were usually as aany scores above the 
seen as below and that there should be a higher score to "compensate" for a 
lower one. Also of interest was the possibility that subjects may not have 
considered the implications of sampling from a large population and may 
consequently have been concerned about sampling without replacement. However, 
only four subjects indicated that they had considered implications of the fact 
that sampling was done without replacement and in only one case did this seem 
to lead to an eventual balancing solution. All but one of the 

* 
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representativeness subjects, when asked whether both means should be 400 if we 
were dealing with actual scores, clearly understood they could not, but 
indicated it was reasonable in this case since the problem asked for the means 
of twa hypothetical samples. Xany subjects were uncomfortable about giving a 
point estimate, indicating that the .variability and uncertainty inherent in 
sampling was very much on their Binds* The point here is not that it is a 
misconception to be aware of the variability associated with the sample mean, 
but that while for experts a point estimate and the variability associated 
with the estimate are separable concepts, novices have difficulty making this 
differentiation. Finally, we can conclude from the interview protocols that 
subjects understood the alternative solutions presented to them reasonably 
well, and were usually capable of indicating the rationales that would lead to 
the* patterns of answers. 

In summary, the data indicate that for most subjects the belief that the 
population mean is the best estimate for both sample means is deeply held. 
They continue to to believe that answer even after being presented with 
alternative solutions, and in spite of the fact that they show reasonably good 
understanding of the rationales underlying these solutions. Moreover, 
detailed analyses of the interview protocols revealed little evidence of 
balancing imagery. The data further suggest that subjects consider the 
representativeness answer to be reasonable because they regard estimates about 
the means of random samples differently than those about the means of samples 
consisting of known scores; and frequently feel quite uneasy about estimating 
the sean of a random sample. 

Xttsgr^ltivity to Sample Size 
Kahne&an and Tversky (1972) showed that people can be quite insensitive 
to the role of sample size in determining the extent to which properties of 
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randca saapies are siailar to those of the parent population. In"% typical 
deaonstratioh of this insensitivity, they presented novices with the following 
problea: 

A certain town is served by two hospitals. In the larger 
hospital about 45 babies are born each day, and in the 
saaller hospital about 15 babies are born eac£ day. As 
you know, about SO percent of all babies are boys. 
•'■ * However, the exact percentage varies froa day to day. 
Seaetiaes it aight be higher than 50 percent, sosetioes 
lower. 

For a period of 1 year, each hospital recorded the 
days on which aore than 60 percent of the babies born were 
boys. Which hospital do you think had sore such days? 
The larger hospital 

The saaller hospital .' 
"-About the sase (that is, within 5 percent of one 
another) 

host subjects thought that the two hospitals would have about an equal 
nusber of days with 60* sale births, and about as sany thought the larger 
hospital would have nore such days as thought the saaller hospital would. 
<The correct answer, of course, is the saaller J^spitaL.) 

Kahneaan and Tversky also conducted a series of studies in which they had 
subjects produce subjective saapling distributions for three saaple sizes. 
For example, they told different groups of subjects that approxi&ately S 
(where N could be 10, 100, or 1000) babies are born each day in a certain 
region. 

For K~1000, the question read; 

On what percentage of days will the * .aber of boys among 
the 1000 babies be as follows: 

Up to 50 boys 

50 to 150 boys 

150 to 2S0 boys 



850 to 950 boys 

Kore than 950 boys 
Note that the categories include all possibilities, so 
your answers should add up to about 100'* 

For N=100, the 11 categories were as follows: Up to 5, 5-15, 15-25, etc. 
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For S-10, the categories were 0, 1, 2, etc. 

Although the correct plot of percentage of days versus category would 
drop off from its peak value auch aore rapidly with increasing saaple size, 
saaple size had no effect whatever on the subjective sampling distributions. 
In other words, the distributions given by the subjects were about the same 
when N was equal to 10, 100, or 1000. 

Kehneaan and Tversky U972J accounted for this insensitivity to saaple 
size by hypothesizing that subjects judged the probability of a saaple by its 
representativeness, that is, by the extent to which the sample is similar. in 
its essential characteristics to the parent population. As about 505; of the 
population of newborns are aale, a' strict application of the 
representativeness heuristic would suggest that the probability of a saaple 
depends on the siailarity of the proportion of sales in that saaple to. 50*. 
Since saaple size is not a characteristic of the population, by this account 
it would not influence the judgment of probability. * They concluded that "the 
notion that sampling variance decreases in proportion to saaple size is 
apparently not part of man's repertoire of intuitions" (p. 44), They further 
implied that the lack of this intuition could explain other misconceptions 
about sample size, e.g., ".. .people often remain skeptical in the face of 
solid evidence from a large saaple, as in the case >©£ the well-known 
politician who complained bitterly that the cost-of-living index is not based 
on the whole population, but only on a large sample, and added, 'worse yet-~a 

random saaple.*" (p. 44) 

On the other hand, it seems hard to believe that people are totally 
insensitive to sample size. We have found students to be such aore 
comfortable with results wnen they are obtained" from larger samples. In fact, 
they seem to distrust any result obtained from a small saaple. 

Bar-Hiilel <1979, 1980, 19S2) was able to find a number of situations in 
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which sublets judged larger samples to be more representative than smaller 
one*. For esample, she found that SO* of her sheets chose the larger saaple 
when she asked then which of two sets of estimates of the percentage of voters 
who intended to vote yes on a certain referenda they had most confidence in: 
those of Firm A who surveyed a sample of 400 individuals or those of Firs B 
who surveyed a sample of 1000 individuals. 

Sore interestingly, she found that it is not sample size per. se that has 
an effect on confidence, but rather relative size or the ratio of the size of 
the sample to the size of the population- When several samples are drawn fro* . 
the same population, " absolute- and relative .sample size are linearly related. 
However, when population size as well as sample size is varied, the effects of 
absolute and relative sample size can be discriminated. Sar-Killel <1979> 

used problems of the following type: r - \ 

Two pollsters are conducting surveys to estimate the 
population of voters in -their respective cities who intend 
to vote yes on a certain referendum. 

Firm A operates in a city of 1 million voters. 

Firm S operates in a city of 50,000 voters. 
Both firms are sampling one out of 1,000 voters. 
Whose estimate would you be more confident in accepting. 

She found that although Firm A has a sample of 1000 and Firm B has a 

sample of only 50, the percentage of subjects who expressed more confidence in 

the larger sample was only 50*, compared to 29* who showed equal confidence in 

both samples. When another group of subjects were told not that both firms 

7 f 

sampled 1 in every 1000 people, but rather that both firms sampled 100O 
people, 62* expressed more confidence in the sanple that caae from the smaller 
city. This strongly suggests that subjects were considering the ratio of 
sar.ple size to population size rather than absolute sample size. In fact, 
when the population is moderately large with respect to the sample, it is 
almost exclusively the absolute rather than the relative saaple size that 



determines saapiins variability. 

It is probably this predisposition to respond to the ratio of the size of 

3 

the sample to that of the population that can explain sooe of the skepticism 
of our aforementioned politician, as well as that with which lay audiences 
seem to treat the results of pre-election polls. based on t»aaple sizes of 
several thousan ' 

In recognition of the fact that under .sose conditions sample size is not 
ignored by novices, Bar-Hi 1 lei (1982) introduced the -notion of a secondary 
sense of representativeness which referred to the procedures by which a sample 
was selected' rather than to the subsequent characteristics of the sa&ple. 
A sample would be wore representative, in this secondary sense, if it was 
large. She found that subjects were more sensitive to saaple size in the 
hospital problem if they were *sked about a sample of SO* or 100* male/births 
rather than S0% and suggested that the use of. representativeness in the 
secondary sense night be triggered by sufficiently discrepant samples. 

Although Bar-Hiliel's distinction is logical enough, it rioes not allow 
us to predict the conditions under which people are sensitive to sample size. 
(What seems to be required at this point, before we can profitably speculate 
further about different intuitions and heuristics, is clarification of those 
conditions. 

We have attempted to investigate tnis issue using a variety of problems 
such as the following: 

When they turn 18, American males must register at a 
-local post office. In addition to other information, the 
height of each sale is", obtained. The national average 
height of IS-year-old sales is 5 feet, 9 inches. 

Evory day for one year, 23 men registered at post 
office A and 100 men registered at post office &. At the 
end of each day, a clerk at each post, off iqe computed and 
recorded the average height of the sen who registered 
there that day. 
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Which would you -expect to be true? 



Version A: 

<i> The average height at post off ice A was closer to the 
national average than was the average height at post 
office E. 

<2> The average height at post office B was closer to the 
national average than wa% the average height at- post 
office A* * 

>* 

C3) There is no reason to expect that the average height 
was closer to the national average at one post office than 
the other. 

Version C: 

(1) The nuaber of days on which the average height was 
between 5 feet, & inches. and & fe et wa s greater for post 
office A than for post office S. 

(2) The nuaber of days on which the average height was, 
between 5 feet, 6 inches and 6 feet was greater for post 
office B than for post office A. 

<3> There is no reason to expect that the nuaber of days 
on which the average height was between 5 feet, & inches 
and & feet was greater for -one post office than the other. 

Version T: 

(1) The nusber of days on which the average height was 6 
feet or acre was greater for post office & than for. post 
office S. 

<2> The number of days on which the average height wag 6 
feet or acre was greatci for post office & than for post 
office A. 

<33 There- is no reason to think that the number of days 
in which the average height was 6 feet or sore was greater 
for one post office than the other. 

The . data froa a sample of undergraduates who had not yet taken a 

statistics course are displayed in Table 3. For Version h, performance was 

• • • 

reasonably good. Fifty-six percent of the subjects thought that, the average 
height recorded at the larger post office would be closer to the national 
average and only 4* selected the smaller post office. When in Version C they 
were asked, in effect, whether there would be acre days in which the average 
height recorded was within 3 inches of the national average at one post office 
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or the other, perforaence was cisilar. Fifty-nine percent chose the larger 
post office and none chose the saaller one. However, when they were asked 
which post office would record sore days with an average over 6 feet €3 inches 
sore than the national" average), the percentage of correct responses was 
significantly lower than for Version A; X <1>s13 - 6 ' P*- 00 ^ or Version C; 
X*l> 3 13-9. p<.00i. Oniy about 8* of subjects correctly picked the saaller 
post office as being r.cre^likeiy to have a discrepant average, while 25*. 
picked the larger post office- 



Insert Table 3 about here. 



The fact that perforaance was so such poorer for Version T than Version C 
is striking. In the latter, - subjects are asked about the central portion of 
the saapiing distribution, and in the former, they are asked about the tail of 
the distribution. tine sight logically think the knowledge that the average 
height recorded is nore likely to be near the national average for the larger 
sasple would translate into the knowledge that the average recorded is less 
likely to be near the national average for the saaller saaple - but quite 
s clesrly, this is not the' case. 

Although we do not fully understand the reasoning of our subjects, these 
results, and those obtained fron interviews with subjects attempting to deal 
with probleas like the ones described above, have led us to believe that aost 
novices do^beiieve that larger sasples are better than saaller ones and will 
correctly anawerprobleas that "directly ask which of the samples is ."better" 
or can be easily translated into those teras. In situations in which absolute 
;/ and relative saaple /size can be distinguished, subjects will be 
j influenced by the latter. Kost subjects will not, however, be able to aake 

the '.inferential 'step necessary\o x conclude that of two equally discrepant 

er|c . anreonrAwmjuu 




samples, the larger U less likely than the seller. We believe that for tome 
sublets, wrong answer* follow from certain misconceptions they have about 
discrepant samples, for example, that a large sample ia more likely to contain 
an extreme score and hence have a discrepant mean. This would explain why 
subjects perform so much better in the hospital problem when the discrepant 
sample is said to consist of 100* boys rather than 60*. However, we feel that 
much of the difficulty is encountered when subjects have to deal implicitly 
with the notion of the sampling distribution in order to answer the problem. 
In the post office problem, for example, it is very easy for subjects to 
confuse the appropriate sampling distribution, namely, the distribution of the , 
afeaUsMS hsieht Sfi Sgrjed on a day" with the r ^rinntion of heights 

T gr»«ted on a day, which is really a very different concept. 

Concluding Consents 
The results discussed in the preceding sections have some pedagogical 
xaplications. Many textbooks, in statistics that discuss the Law of Large 
Jiumbers attempt to dispel students' belief in the gambler's fallacy. However,, 
> they assume that the basic misconception students have is active balancing, 

and they oppose this mechanism with the notion of "swamping" in which the 
large amount of subsequent data overwhelms the impact of an initial discrepant 
score on the mean (e.g., Kays, 1981) . Our own attempts to teach this 
conceptual imat ion have not been very successful. Our research suggests that 
' such an approach is likely to be unfruitful because the problem is not that 
students think in terms of an incorrect process mechanism but that they do not 
think of random sampling in terms bf any process model. To refute active 
balancing is to refute a belief that most students do not have and this may 
confuse them. Since the most coamon heuristic, representativeness, is so 
different in form fro? the appropriate process model, it will not be easy to 
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ae t up an appropriate confrontation between the two systems to effect a 
lasting change in students' beliefs about random samples unless increased 
eaphasis is placed on instilling a process view of aaapling. 

Also, given the work done on sensitivity to sample size, it is 
increasingly clear that that basic concepts and principles must be illustrated^ 
with, a variety of examples if students are to be able to generalize them 
appropriately. The results presented • above ~show~t hat subjects can understand 
a basic principle at one level (i.e., that larger saaples are more 
representative than saaller ones), but fail to sake judgment's that sees to 
follow directly fron it. Confronting students with their answers to problems- 
like the ones we have discussed also seeas to have the potential for making 
the* thiak — acre -appropriately about, sampling distributions and the 
implications of sample size for sampling variability. 

8 
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Table 1 





Solution Type 
Kean of . Sean*' of 
10 scores S scores 


Label 

a 


Questionnaires 


Interviews 






t- 






Less than 400 


400 


Correct Solution 


44(21%) ( 


/ 

f 6(19%) 


400 


400 


Representative 


68(33%) 


15(48%) 


400 c 


400* 


Balancing 


• 25(12%) 


6(19%) 


400--* 

* 


400- 

• 


Trend 


18(9%) 


2(6%) 




* 


Unclassified 


50(24*) 


2(6%) 


Totals 






• 

205 


"31 


*For the trend 


solution* 


scan of 10 scores 


< mean of 9 scores 


< 100. 



. ... Tabic 2 

Frequency of Solution Tvses. Study 2 



Position in 
interview 



.-Solution Type 



Correct, Representative Balancing Trend Unclassified 



\ * 

\ 



Final answer 
before alternative 
solutions were 
presented . . 



SC20SO 



14C5&%> 



3C12*> 1C4«> 



2CS*> 



Answer at end of 
interview 



4C16S) 



"5> 



\ 



12C4SS5 



7(28*3 - 0 



: \ . 

\ 



ERIC 



-26 



23 



'Table 3 

Freauencv of Solution Types in Sample Size Study 



Solution Type 



Version 
of problem 



Correct 



Reverse 



Same 



A 

C 

T 



42(56*5"" 

23(59*) 
3(3.3*) 



3(4*) 
0 

S(25X) 



30(40*) 
24(66.7*) 



A 



ERIC 
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27 



24 
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